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Abstract 

Coordinating  agents  to  complete  a  set  of  tasks  with  in- 
tercoupled  temporal  and  resource  constraints  is  com¬ 
putationally  challenging,  yet  human  domain  experts 
can  solve  these  difficult  scheduling  problems  using 
paradigms  learned  through  years  of  apprenticeship.  A 
process  for  manually  codifying  this  domain  knowl¬ 
edge  within  a  computational  framework  is  necessary  to 
scale  beyond  the  one-expert,  one-trainee  apprenticeship 
model.  However,  a  human  domain  expert  often  has  dif¬ 
ficulty  describing  their  decision-making  process,  caus¬ 
ing  the  codification  of  this  knowledge  to  become  labori¬ 
ous.  We  propose  a  new  approach  for  capturing  domain- 
expert  heuristics  through  a  pairwise  ranking  formula¬ 
tion.  Our  approach  is  model-free  and  does  not  require 
enumerating  or  iterating  through  a  large  state-space. 

We  empirically  demonstrate  that  this  approach  accu¬ 
rately  learns  multi-faceted  heuristics  on  both  a  synthetic 
data  set  incorporating  job-shop  scheduling  and  vehicle 
routing  problems  and  a  real-world  data  set  consisting 
of  demonstrations  of  experts  solving  a  variant  of  the 
weapon-to-target  assignment  problem.  Our  approach 
is  able  to  learn  scheduling  policies  of  superior  quality 
to  those  generated,  on  average,  by  human  experts  con¬ 
ducting  an  anti-ship  missile  defense  task. 

Introduction 

Optimization  and  scheduling  of  resources  is  a  costly,  chal¬ 
lenging  problem  that  affects  almost  every  aspect  of  our  lives. 
In  healthcare,  patients  with  non-urgent  needs  who  experi¬ 
ence  prolonged  wait  times  have  higher  rates  of  treatment 
noncompliance  and  missed  appointments  (Kehle  et  al.  2011; 
Pizer  and  Prentice  2011).  In  military  engagements,  the 
weapon-to-target  assignment  problem  requires  warfighters 
to  deploy  minimal  resources  to  mitigate  as  many  threats  as 
possible  while  maximizing  the  duration  of  survival  (Lee, 
Su,  and  Lee  2003).  The  problem  of  optimal  task  alloca¬ 
tion  and  sequencing  with  upper-  and  lowerbound  temporal 
constraints  (i.e.,  deadlines  and  wait  constraints)  is  NP-Hard 
(Bertsimas  and  Weismantel  2005),  and  real-world  schedul¬ 
ing  problems  quickly  become  computationally  intractable. 
However,  human  domain  experts  are  able  to  learn  from  ex¬ 
perience  to  develop  strategies,  heuristics  and  rules-of-thumb 
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to  effectively  respond  to  these  problems.  The  challenge  we 
pose  is  to  autonomously  learn  the  strategies  employed  by 
these  domain  experts.  This  knowledge  can  be  applied  and 
disseminated  more  efficiently  with  such  a  model  than  with  a 
single-expert,  single-apprentice  model. 

Researchers  have  realized  important  progress  toward  cap¬ 
turing  domain-expert  knowledge  from  demonstration  (Berry 
et  al.  2011;  Abbeel  and  Ng  2004;  Konidaris,  Osentoski,  and 
Thomas  2011;  Zheng,  Liu,  and  Ni  2015;  Odom  and  Natara- 
jan  2015;  Vogel  et  al.  2012;  Ziebart  et  al.  2008).  For  exam¬ 
ple,  in  one  recent  work  (Berry  et  al.  201 1)  an  Al  scheduling 
assistant,  called  PTIME,  learns  how  users  prefer  to  schedule 
events.  PTIME  can  then  propose  scheduling  changes  when 
new  events  occur  by  solving  an  integer  program.  Two  limita¬ 
tions  to  this  work  exist:  PTIME  requires  users  to  explicitly 
rank  their  preferences  over  scheduling  options  to  initialize 
the  system,  and  PTIME  uses  a  complete  solver,  which  must 
consider  an  exponential  number  of  options  in  the  worst  case. 

Research  aimed  at  capturing  domain  knowledge  solely 
based  on  user  demonstration  led  to  the  development  of  In¬ 
verse  Reinforcement  Learning  (IRL)  (Abbeel  and  Ng  2004; 
Konidaris,  Osentoski,  and  Thomas  2011;  Zheng,  Liu,  and 
Ni  2015;  Odom  and  Natarajan  2015;  Vogel  et  al.  2012; 
Ziebart  et  al.  2008).  IRL  serves  the  dual  purpose  of  learn¬ 
ing  an  unknown  reward  function  for  a  given  problem  and 
learning  a  policy  to  optimize  that  reward  function.  How¬ 
ever,  there  are  two  primary  drawbacks  to  IRL  for  scheduling 
problems:  computational  tractability  and  the  need  for  an  en¬ 
vironment  model. 

In  the  classical  apprenticeship  learning  algorithm  devel¬ 
oped  by  Abbeel  and  Ng  in  2004,  one  must  solve  a  Markov 
Decision  Process  (MDP)  repeatedly  until  a  convergence  cri¬ 
teria  is  satisfied.  However,  enumerating  a  large  state-space, 
such  as  one  found  in  large-scale  scheduling  problems  in¬ 
volving  hundreds  of  tasks  and  tens  of  agents,  can  quickly 
become  computationally  intractable  due  to  memory  limi¬ 
tations.  Approximate  dynamic  programming  approaches 
exist  which  essentially  reformulate  the  problem  as  regres¬ 
sion  (Konidaris,  Osentoski,  and  Thomas  2011;  Mnih  et  al. 
2015),  yet  the  amount  of  data  required  to  regress  over  a  large 
state  space  remains  challenging,  and  MDP-based  scheduling 
solutions  exist  only  for  simple  problems  (Wu  et  al.  2011; 
Wang  and  Usher  2005;  Zhang  and  Dietterich  1995). 

IRL  also  requires  a  model  of  the  environment  for  training. 


At  its  most  basic,  reinforcement  learning  uses  a  Markovian 
transition  matrix  that  describes  the  probability  of  transition¬ 
ing  from  an  initial  state  to  a  subsequent  state  when  taking 
a  given  action.  For  circumstances  in  which  the  environ¬ 
ment  dynamics  are  unknown  or  difficult  to  model  within  the 
constraints  of  a  transition,  researchers  have  developed  Q- 
Learning  and  its  variants,  which  have  had  much  recent  suc¬ 
cess  (Mnih  et  al.  2015).  However,  these  approaches  require 
the  ability  to  practice,  or  explore  the  state-space  by  query¬ 
ing  a  black-box  emulator  to  solicit  information  about  how 
taking  a  given  action  in  a  specific  state  changes  that  state. 

Another  effort  has  been  to  directly  learn  a  function  that 
maps  states  to  actions  (Chernova  and  Veloso  2007;  Terrell 
and  Mutlu  2012;  Huang  and  Mutlu  2014).  For  example,  Ra- 
manujam  and  Balakrishnan  trained  a  discrete-choice  model 
using  real  data  from  air  traffic  controllers  and  showed  how 
the  model  can  accurately  predict  the  correct  runway  config¬ 
uration  for  an  airport  (Ramanujam  and  Balakrishnan  2011). 
Sammut  et  al.  (Sammut  et  al.  1992)  applied  a  decision  tree 
model  for  an  autopilot  to  learn  to  control  an  aircraft  from  ex¬ 
pert  demonstration.  Action-driven  learning  techniques  offer 
much  promise  for  learning  policies  from  expert  demonstra¬ 
tors,  but  they  have  not  been  applied  to  complex  scheduling 
problems.  In  order  for  these  methods  to  succeed,  one  must 
model  the  scheduling  problem  in  a  way  that  allows  for  effi¬ 
cient  computation  of  a  scheduling  policy. 

In  this  paper,  we  propose  a  technique,  which  we  call  “ap¬ 
prenticeship  scheduling,”  to  capture  this  domain  knowledge 
in  the  form  of  a  scheduling  policy.  Our  objective  is  to  learn 
scheduling  policies  through  expert  demonstration  and  vali¬ 
date  that  schedules  produced  by  the  policies  are  of  compara¬ 
ble  quality  to  those  generated  by  human  or  synthetic  experts. 
Our  approach  efficiently  utilizes  domain-expert  demonstra¬ 
tions  without  the  need  to  train  within  an  environment  em¬ 
ulator.  Rather  than  explicitly  modeling  a  reward  function 
and  relying  on  dynamic  programming  or  constraint  solvers, 
which  become  computationally  intractable  for  large-scale 
problems  of  interest,  our  objective  is  to  use  action-driven 
learning  to  extract  the  strategies  of  domain  experts  to  effi¬ 
ciently  schedule  tasks. 

The  key  to  our  approach  is  using  pairwise  comparisons 
between  the  actions  taken  (e.g.,  schedule  agent  a  to  com¬ 
plete  task  Ti  at  time  t)  and  the  set  of  actions  not  taken  (e.g., 
unscheduled  tasks  at  time  t)  to  learn  relevant  model  param¬ 
eters  and  scheduling  policies  demonstrated  by  the  training 
examples.  We  validate  our  approach  using  both  a  synthetic 
data  set  of  solutions  for  a  variety  of  scheduling  problems, 
and  a  real-world  data  set  of  demonstrations  from  human  ex¬ 
perts  solving  a  variant  of  the  weapon-to-target  assignment 
problem. 

Preliminaries 

We  aim  to  empirically  demonstrate  the  generalizability  of 
our  learning  approach  through  application  to  a  variety  of 
problem  types.  Korsah  et  al.  provide  a  comprehensive  tax¬ 
onomy  for  classes  of  scheduling  problems,  which  vary  with 
formulation  of  constraints,  variables,  and  objective  or  utility 
function  (Korsah,  Stentz,  and  Dias  2013).  Within  this  tax¬ 
onomy,  there  are  four  classes  addressing  interrelated  utilities 


and  constraints:  No  Dependencies  (ND),  In-Schedule  De¬ 
pendencies  (ID),  Cross-Schedule  Dependencies  (XD),  and 
Complex  Dependencies  (CD).  The  ND  problem  class  con¬ 
sists  of  independent  tasks  that  must  be  assigned  to  agents, 
where  the  utility  of  one  assignment  does  not  affect  the  utility 
of  other  possible  assignments.  This  class  of  problems  is  ad¬ 
dressed,  for  example,  in  (Liu  and  Shell  2013).  The  ID  class 
consists  of  problems  in  which  the  assignment  of  an  agent 
to  a  task  may  affect  the  utility  of  other  tasks  assigned  to 
that  agent.  This  class  is  addressed,  for  example,  in  (Brunet, 
Choi,  and  How  2008)  and  (Nunes  and  Gini  2015).  For  XD 
class  problems,  making  any  assignment  of  an  agent  to  a  task 
can  affect  the  utility  of  any  other  agent  performing  any  other 
task,  as  addressed  in  (Gombolay,  Wilcox,  and  Shah  2013). 
The  CD  class  incorporates  conditional  constraints,  where  as¬ 
signing  an  agent  to  a  task  affects  the  manner  in  which  that 
task  and  other  tasks  can  be  performed.  For  example,  in  a 
rescue  scenario,  the  travel  routes  available  to  fire  trucks  de¬ 
pend  on  the  roads  the  bulldozers  are  assigned  to  clear  (Jones, 
Dias,  and  Stentz  2011). 

The  Korsah  et  al.  taxonomy  also  delineates  between  tasks 
that  require  one  agent  to  perform,  i.e.  “single-agent  tasks” 
(SA),  and  tasks  requiring  multiple  agents,  i.e.  “multi-agent 
tasks”  (MA).  Agents  that  perform  one  task  at  a  time  are 
“single-task  agents”  (ST),  and  agents  capable  of  perform¬ 
ing  multiple  tasks  at  the  same  time  are  “multi-task  agents” 
(MT).  Lastly,  the  taxonomy  distinguishes  between  instanta¬ 
neous  assignment  (lA),  in  which  all  task  and  schedule  com¬ 
mitments  are  made  at  the  same  time,  versus  time-extended 
assignment  (TA),  in  which  current  and  future  commitments 
are  planned. 

In  this  work,  we  demonstrate  our  approach  for  two  fami¬ 
lies  of  scheduling  problems  that  span  these  classes.  The  first 
problem  is  the  “Vehicle  Routing  Problem  with  Time  Win¬ 
dows,  Temporal  Dependencies,  and  Resource  Constraints 
(VRPTW-TDR),”  which  is  an  XD  [ST-SA-TA]  class  prob¬ 
lem.  Depending  on  parameter  selection,  this  family  of  prob¬ 
lems  encompasses  the  traveling  salesman,  job-shop  schedul¬ 
ing,  multi-vehicle  routing,  and  multi-robot  task  allocation 
problems,  among  others.  We  consider  agents  to  perform 
tasks  sequentially  (ST)  and  each  task  to  require  one  agent 
(SA),  with  commitments  made  over  time  (TA).  We  also  as¬ 
sume  agents  are  heterogeneous  in  that  they  perform  tasks 
at  different  rates.  An  agent  that  is  incapable  of  performing  a 
task  is  specified  with  a  null  completion  rate.  The  objective  is 
to  minimize  the  makespan  or  other  time-based  performance 
measure.  Agents  and  tasks  have  defined  starting  locations, 
and  task  locations  are  static.  Each  agent  travels  with  a  con¬ 
stant  speed  between  task  locations,  and  agents  may  only  per¬ 
form  tasks  when  at  the  corresponding  task  location. 

The  second  problem  is  the  “Weapon-To-Target  Assign¬ 
ment  Problem  (WTA),”  which  involves  selecting  weapons 
(e.g.,  a  missile)  to  fire  at  targets  (e.g.,  a  stationary  military 
compound).  The  canonical  formulation  as  described  in  (Lee, 
Su,  and  Lee  2003)  is  in  the  ND  class.  However,  we  con¬ 
sider  a  more  complex,  CD  [MT-MA-TA]  class  variant  of 
the  problem  for  anti-ship  missile  defense  (ASMD).  Here, 
one  must  determine  how  to  deploy  a  set  of  soft  kill  weapons, 
also  known  as  decoys,  to  distract  an  enemy’s  anti-ship  mis- 


sile  from  impacting  one’s  own  ship.  These  decoys  are  the 
agents,  and  the  neutralization  of  missiles  are  the  tasks.  The 
effectiveness  Ef  of  deploying  a  decoy  a  against  target  Tj  at  a 
given  location  xi  =  [x,y,  9]  and  time  t  is  dependent  on  the 
time  history  of  all  other  decoy  deployments  h.  Decoys  can 
distract  many  missiles  (MT),  and  many  decoys  can  be  used 
to  distract  the  same  missile  along  various  points  of  its  trajec¬ 
tory.  Task  allocation  and  scheduling  commitments  are  made 
over  time  (TA).  The  key  challenge  of  this  problem  is  that  the 
time  history  of  how  decoys  have  been  deployed  thus  far  af¬ 
fects  the  future  effectiveness  of  decoys,  where  they  should 
be  deployed,  and  when  they  should  be  deployed.  Agents  and 
tasks  have  defined  starting  locations.  Each  task  (i.e.,  missile) 
is  modeled  as  a  dynamical  system  with  a  homing  function 
FT{h,t)  that  guides  the  missile  towards  its  target  and  is  a 
function  of  the  current  time  and  the  time  history  h  of  previ¬ 
ous  decoy  deployments.  Decoys  travel  with  a  constant  speed 
to  their  target  locations  Xg  from  the  ship  that  deploys  them. 

Model  for  Apprenticeship  Learning 

In  this  section  we  present  a  framework  for  learning,  via  ex¬ 
pert  demontration,  a  scheduling  policy  that  correctly  deter¬ 
mines  which  task  to  schedule  as  a  function  of  task  state. 

Many  approaches  to  learning  such  models  are  based  on 
Markov  models,  such  as  reinforcement  learning  or  inverse 
reinforcement  learning  (Busoniu,  Babuska,  and  De  Schut- 
ter  2008;  Barto  and  Mahadevan  2003;  Konidaris  and  Barto 
2007;  Puterman  2014).  These  models,  however,  do  not  cap¬ 
ture  the  temporal  dependencies  between  states  and  are  com¬ 
putationally  intractable  for  large  problem  sizes.  To  deter¬ 
mine  which  tasks  to  schedule  at  which  times,  we  draw  in¬ 
spiration  from  the  domain  of  web  page  ranking  (Page  et  al. 
1999),  or  predicting  the  most  relevant  web  page  in  response 
to  a  search  query.  One  important  component  of  page  ranking 
is  capturing  how  pages  relate  to  one  another  as  a  graph  with 
nodes  (i.e.,  web  pages)  and  directed  arcs  (i.e.,  links  between 
those  pages)  (Page  et  al.  1999).  This  connectivity  is  a  suit¬ 
able  analogy  for  the  complex  temporal  dependencies  (i.e., 
precedence,  wait,  and  deadline  constraints)  relating  tasks  in 
a  scheduling  problem. 

Recent  approaches  to  page  ranking  have  focused  on  pair¬ 
wise  and  listwise  models,  which  have  been  shown  to  have 
advantages  over  pointwise  models  (Valizadegan  et  al.  2009). 
In  listwise  ranking,  the  goal  is  to  generate  a  ranked  list  of 
web  pages  directly  (Cao  et  al.  2007;  Valizadegan  et  al.  2009; 
Volkovs  and  Zemel  2009),  while  a  pairwise  approach  deter¬ 
mines  ranking  based  on  pairwise  comparisons  between  in¬ 
dividual  pages  (Jin,  Valizadegan,  and  Li  2008;  Pahikkala  et 
al.  2007).  We  chose  the  pairwise  formulation  to  model  the 
problem  of  predicting  the  best  task  to  schedule  at  time  t. 

The  pairwise  model  has  key  advantages  over  the  listwise 
approach.  First,  classification  algorithms  (e.g.,  support  vec¬ 
tor  machines)  can  be  directly  applied  (Cao  et  al.  2007).  Sec¬ 
ond,  a  pairwise  approach  is  non-parametric,  in  that  the  cardi¬ 
nality  of  the  input  vector  is  not  dependent  upon  the  number 
of  tasks  (or  actions)  that  can  be  performed  in  any  instance. 
Third,  training  examples  of  pairwise  comparisons  in  the  data 
can  be  readily  solicited.  From  a  given  observation  in  which 
a  task  was  scheduled,  we  only  know  which  task  was  most 


important  -  not  the  relative  importance  between  all  tasks. 
Thus,  we  create  training  examples  based  on  pairwise  com¬ 
parisons  between  the  scheduled  and  unscheduled  tasks.  A 
pairwise  approach  is  most  natural  because  we  lack  a  con¬ 
text  to  determine  the  relative  rank  between  two  unscheduled 
tasks. 

Consider  a  set  of  tasks,  €  r,  each  of  which  has  a  set  of 
real-valued  features,  ■  Each  scheduling-relevant  feature 
7^  may  represent,  for  example,  the  deadline,  the  earliest 
time  the  task  is  available,  the  duration  of  the  task,  which  re¬ 
source  r  is  required  by  this  task,  etc.  Next,  consider  a  set 
of  m  observations,  O  =  {Oi,02,  ■  ■  ■ ,  Om}-  Observation 
Om  consists  of  a  feature  vector  {7,-^ ,  ,  •  ■  • ,  7r„  }  describ¬ 

ing  the  state  of  each  task,  the  task  scheduled  by  the  expert 
demonstrator  (including  a  null  task,  t$,  if  no  task  was  sched¬ 
uled)  and  the  time  at  which  an  action  was  taken.  The  goal  is 
to  then  learn  a  policy  that  correctly  determines  which  task  to 
schedule  as  a  function  of  task  state. 

We  deconstruct  the  problem  into  two  steps;  Step  1):  For 
each  agent/resource  pair,  determine  the  candidate  next  task 
to  schedule.  Step  2);  For  each  task,  determine  whether  to 
schedule  the  task  from  the  current  state.  In  order  to  learn 
to  correctly  assign  the  next  task  to  schedule,  we  transform 
each  observation  Om  into  a  new  set  of  observations  by  per¬ 
forming  pairwise  comparisons  between  the  scheduled  task 
Ti  and  the  set  of  tasks  that  were  not  scheduled  (Equations 
1-2).  Equation  1  creates  a  positive  example  for  each  ob¬ 
servation  in  which  a  task  Ti  was  scheduled.  This  example 
consists  of  the  input  feature  vector,  ^  y  and  a  positive 
label,  ^  ^  =  1.  Each  element  of  the  input  feature  vec¬ 
tor  (f)'y  ^  ^  is  computed  as  the  difference  between  the  corre¬ 
sponding  values  in  the  feature  vectors  7,-;  and  ,  describ¬ 
ing  scheduled  tasks  Ti  and  unscheduled  task  Tx-  Equation 
2  creates  a  set  of  negative  examples  with  y™  ^  ^  =0.  For 
the  input  vector,  we  take  the  difference  of  the  feature  values 
between  unscheduled  task  Tx  and  scheduled  task  t^. 

This  feature  set  is  then  augmented  to  capture  additional 
contextual  information  important  for  scheduling,  which  may 
not  be  captured  in  examples  consisting  solely  of  differences 
between  features  of  tasks.  For  example,  one’s  scheduling 
policy  may  change  based  on  the  progress  towards  comple¬ 
tion  of  the  tasks,  i.e.  based  on  proportion  of  tasks  com¬ 
pleted  so  far.  To  provide  this  high-level  information,  we 
include  the  set  of  contextual,  high-level  features  de¬ 
scribing  the  set  of  tasks  for  observation  Om,  in  (Equa¬ 
tions  1-2).  Prior  work  has  shown  that  domain  experts  are 
adept  at  describing  the  features  (both  high-level,  contex¬ 
tual  and  task-specific)  used  in  their  decision-making,  yet, 
it  is  more  difficult  for  experts  to  describe  how  they  rea¬ 
son  about  these  features  (Cheng,  Wei,  and  Tseng  2006; 
Raghavan,  Madani,  and  Jones  2006). 

We  can  use  these  observations  to  train  a  classifier 
f priority {Ti,Tx)  G  {0, 1}  to  predict  whether  it  is  better  to 
schedule  task  Ti  as  the  next  task  rather  than  Tx-  Given  this 
pairwise  classifier,  we  can  determine  which  single  task  Ti 
is  the  highest  priority  task  t*  according  to  Equation  3  by 
determining  which  task  is  most  often  higher  priority  in  com¬ 
parison  to  the  other  tasks  in  r. 


Next,  we  must  learn  to  predict  whether  t*  should  be 
scheduled  or  the  agent  should  remain  idle.  We  train  a  second 
classifier,  fact(ji)  G  {0, 1},  which  predicts  whether  or  not 
Ti  should  be  scheduled.  In  our  observations  set,  O,  we  only 
have  examples  in  which  a  task  was  scheduled  and  those  in 
which  no  task  was  scheduled.  To  train  this  classifier,  we  con¬ 
struct  a  new  set  of  examples  according  to  Equation  4  where 
positive  labels  are  assigned  to  examples  from  Om  in  which 
a  task  was  scheduled  and  negative  labels  to  examples  in  Om 
in  which  no  task  was  scheduled. 

Finally,  we  construct  a  scheduling  algorithm  to  act  as  an 
apprentice  scheduler  (Figure  1).  Fines  1-  2  iterate  over  each 
agent  at  each  time  step.  In  Fine  3,  the  highest  priority  task 
T*  is  determined  for  a  particular  agent.  In  Fines  4-  5,  r*  is 
scheduled  iff  fact  (t*  )  predicts  that  t*  should  be  scheduled 
at  the  current  time. 

A  benefit  of  the  pairwise  ranking  formulation  is  that  one 
can  apply  any  one  of  a  number  of  standard  machine  learn¬ 
ing  classification  techniques  to  learn  f priority {Ti,Tx)  and 
fact{Ti)-  In  our  experimental  evaluation,  we  compare  the 
performance  of  a  decision  tree,  support  vector  machine  and 
other  common  classification  techniques. 


Algorithm  1  Pseudocode  for  an  Apprentice  Scheduler 


Synthetic  Data  Set 

First  we  generated  a  synthetic  dataset  in  which  schedules 
were  produced  through  application  of  context-dependent 
scheduling  heuristics.  Our  objective  was  to  show  that  our 
technique  learns  both  the  heuristics  and  policy  for  their  cor¬ 
rect  application.  We  constructed  a  synthetic  data  set  based 
on  the  Vehicle  Routing  Problem  with  Time  Windows,  Tem¬ 
poral  Dependencies,  and  Resources.  Problems  involved  two 
heterogeneous  agents  and  20  partially  ordered  tasks  located 
within  a  20  X  20  grid. 

We  constructed  a  mock  heuristic  to  serve  as  our  source  for 
synthetic -expert  demonstrations,  as  shown  in  Figure  2.  Our 
heuristics  were  based  on  our  prior  work  in  scheduling  (Tan 
et  al.  2001;  Gombolay,  Wilcox,  and  Shah  2013)  and  prior 
work  addressing  the  vehicle  routing  problem  with  time  win¬ 
dows  (Solomon  1987).  In  Fines  1-6,  the  algorithm  collects 
all  alive  and  enabled  tasks  Ti  G  AE  as  defined  by  (Muscet- 
tola,  Morris,  and  Tsamardinos  1998).  Consider  a  pair  of 
tasks  Ti  and  Tj,  with  start  and  finish  times  Si,  fi  and  sj,  fj, 
respectively,  such  that  there  is  a  wait  constraint  requiring  Ti 
to  start  at  least  W(_rj,Ti)  units  of  time  after  tj.  A  task  Ti  is 
alive  and  enabled  if  f  >  fj  +  Wr-^n  for  all  such  Tj  and 


ApprenticeScheduler('r,A,rC,Tji) 

1:  for  f  =  0  to  T  do 
2:  for  all  agents  a  G  Ado 

3:  T*  ^  argpiax  X]  fpriorrtyinjTx) 

TiGT  Tx  G  T 

4:  if /act(r*)  ==  1  then 

5:  Schedule  t* 

6:  end  if 

7:  end  for 

8:  end  for 


rankom  .  \c  1 

-  7rJ  , 

y{Ti,Tx) 

^Tx  G  T\Ti,yOra  G  0\Ti  Scheduled  in  Om  (1) 

Vtj;  €  T\Ti,yOm  G  0\Ti  scheduled  in  Om  (2) 

T*  =  argnax  ^  f priority  {n^Tx)  (3) 

Tie-r  ~T 

TxGT 

-V™  :=  Kx,7n], 

{1  :  Ti  scheduled  in  Om  A 

Ti  scheduled  in  Om+i  (4) 
0  :  T0  scheduled  in  Om 

Data  Sets 

Next,  we  validate  that  schedules  produced  by  our  learned 
policies  are  of  comparable  quality  to  those  generated  by  hu¬ 
man  or  synthetic  experts. 


Algorithm  2  Pseudocode  for  the  Mock  Heuristic 
MockHeuristic(T,A,TC,Tfi) 

1:  tae  <—  initialize  alive  and  enabled  task  set 
2:  for  all  Ti  G  T  do 

3:  if  all  wait  constraints  for  Ti  have  been  satisfied  then 

4:  TAE  ^  tae  U  Ti 

5:  end  if 

6:  end  for 

7 :  for  all  agents  a  G  A  do 

8:  if  Speed  <1^  //Vehicle  Routing  Problem  then 

9:  lx  G-  location  of  Tx 

10:  la  G-  location  of  agent  a 

/I  acosilfla) 

“  llAlllliall 

12:  T*  ^  argnin  (||/^  -  (oil 

13:  ~\~C4\0xa 

14:  else  if  ^  c  //Resource  Con¬ 

tention  Mode  then 

15:  T*  ^  argnax  ( Er,  =Rr^ )  “  ) 

16:  else 

17:  T*  G-  argnindT-j, 

Tx^TAE 

18:  end  if 

19:  if  Rr*  is  unoccupied  at  time  t  then 

20:  if  agent  a  could  travel  to  reach  r*  by  time  t  then 

21:  Scheduler* 

22:  end  if 

23:  end  if 

24:  end  for 


Next,  the  heuristic  iterates  over  each  agent  and  task  to 
find  the  highest-priority  task  r*  to  schedule  for  each  agent. 


In  Lines  7-18,  the  algorithm  determines  which  heuristic  is 
most  appropriate  to  apply.  If  agent  speed  is  sufficiently  slow, 
travel  time  will  become  the  major  bottleneck.  If  the  agents 
are  fast  but  one  or  more  resources  are  heavily  utilized,  use  of 
these  resources  can  become  the  bottleneck.  Otherwise,  task 
durations  and  associated  wait  constraints  are  generally  most 
important. 

In  Line  8,  the  algorithm  identifies  travel  distance  as  the 
most  important  bottleneck  and  switches  to  a  heuristic  well- 
suited  for  vehicle  routing  that  minimizes  a  weighted,  lin¬ 
ear  combination  of  features  (Gambardella,  Eric  Taillard,  and 
Agazzi  1999;  Solomon  1987)  comprised  of  the  distance  and 
angle  relative  to  the  origin  between  agent  a  and  Tx  and  an 
indicator  term  for  whether  Tx  must  be  executed  to  satisfy  a 
wait  constraint  for  another  task  r^.  This  rule  is  based  on 
prior  work  on  the  vehicle  routing  problem  (Gambardella, 
Eric  Taillard,  and  Agazzi  1999;  Solomon  1987)  and  on  a 
heuristic  proposed  to  mitigate  resource-contention  in  multi¬ 
robot,  multi-resource  problems  (Gombolay,  Wilcox,  and 
Shah  2013).  In  Line  14,  the  algorithm  determines  that 
there  may  be  a  resource  bottleneck  and  tries  to  alleviate  it 
by  switching  to  a  resource-contention  mode  and  applying 
a  heuristic  that  returns  the  task  t*  G  tae  that  maximizes 
a  weighted,  linear  combination  of  the  commonality  of  the 
task’s  required  resource  less  its  deadline.  If  neither  travel 
distance  nor  resource  contention  are  perceived  as  the  major 
bottlenecks,  the  algorithm  switches  to  applying  an  Earliest 
Deadline  Eirst  rule  (Line  16),  which  performs  well  across 
many  scheduling  domains  (Chen  et  al.  2014;  Gombolay 
et  al.  2013).  If  the  resource  required  for  r*,  Rt*,  is  idle 
(Line  19),  and  the  agent  is  able  to  reach  the  task  by  time 
t  (Line  20),  then  the  heuristic  schedules  task  r*  at  time  t 
(Line  21).  We  note  that  an  agent  is  able  to  reach  task  t*  if 
t  >  fj+k{xi  —  Xj)  /\\xi  —  XjW  for  all  r,  G  r  that  the  agent 
has  already  completed,  where  k  is  the  agent’s  speed. 

With  the  heuristic  shown  in  Algorithm  2,  we  generated  a 
set  of  training  data  incorporating  30,  000  task  sets  -  10,  000 
for  each  type  of  bottleneck  identified  by  the  heuristic.  A 
spectrum  of  problems  (i.e.  traveling  salesman,  job-shop 
scheduling,  multi-vehicle  routing)  was  represented,  as  task 
locations,  agent  travel  speeds  and  task  completion  rates  were 
varied.  These  task  sets  provided  533,  737  observations.  In 
96%  of  the  observations,  the  mock  heuristic  idled  (i.e.,  chose 
not  to  schedule  a  task),  and  in  4%  of  the  observations,  the 
mock  heuristic  scheduled  an  agent  to  complete  a  task. 

Real-World  Data  Set 

We  collected  a  real-world  data  set  consisting  of  human 
demonstrators  of  various  skill  levels  solving  the  ASMD 
weapon-to-target  assignment  problem.  We  utilized  a  virtual 
gaming  environment  requiring  players  to  manage  a  set  of 
heterogeneous  decoys  to  defeat  raids  of  heterogeneous  en¬ 
emy  anti-ship  missiles.  We  modeled  a  scenario  with  five 
types  of  decoys  and  ten  types  of  threats.  The  threats  were 
randomly  generated  for  each  played  scenario,  thereby  pro¬ 
moting  the  development  of  strategies  that  were  robust  to  a 
distribution  of  threat  scenarios.  Each  decoy  had  a  speci¬ 
fied  effectiveness  against  each  threat  type.  Players  attempted 
to  deploy  a  set  of  decoys  of  the  correct  decoy  types,  at  the 


right  location,  and  at  the  right  times  in  order  to  distract  in¬ 
coming  missiles.  Threats  were  launched  over  time,  mean¬ 
ing  an  effective  deployment  at  time  t  could  become  coun¬ 
terproductive  at  a  future  point  in  time  as  new  enemy  mis¬ 
siles  were  launched.  Games  were  scored  as  follows:  10,000 
points  were  received  each  time  a  threat  was  neutralized  and 
2  points  for  each  second  each  threat  spent  homing  in  on  a 
decoy.  5,000  points  were  subtracted  for  each  threat  impact 
and  1  points  for  each  second  each  threat  spent  homing  in  on 
one’s  own  ship.  Lastly,  25-1,000  points  were  subtracted  for 
each  deploy  of  a  decoy,  depending  on  the  type. 

The  collected  data  set  consisted  of  3 1 1  games  played  from 
35  human  players  across  45  threat  configurations  or  “sce¬ 
narios”.  We  sub-selected  sixteen  threat  configurations  such 
that  each  configuration  had  at  least  one  human  demonstra¬ 
tion  that  mitigated  all  enemy  missiles.  Eor  these  sixteen 
threat  configurations,  there  were  162  total  games  played  by 
27  unique  human  demonstrators.  Players  consisted  of  tech¬ 
nical  fellows  and  associates  as  well  as  contractors  at  MIT 
Lincoln  Laboratory,  and  their  expertise  varied  from  “gener¬ 
ally  knowledgeable  about  the  ASMD  problem”  to  “domain 
experts”  with  professional  experience  or  training  in  anti-ship 
missile  defense. 

Empirical  Evaluation 

In  this  section,  we  evaluate  our  prototype  for  apprenticeship 
scheduling  on  the  synthetic  and  real-world  data  sets. 

Synthetic  Data  Set 

We  trained  our  model  using  a  decision  tree,  KNN  classifier, 
logistic  regression  (logit)  model,  support  vector  machine 
with  a  radial  basis  function  kernel  (SVM-RBE)  and  a  neu¬ 
ral  network  to  learn  /priority  {■,  •)  facti-)-  We  randomly 
sampled  85%  of  the  data  for  training  and  15%  for  testing. 
We  defined  the  features  as  follows:  The  high-level  feature- 
vector  of  the  task  set,  is  comprised  of  the  agents’  speed 
and  the  degree  of  resource  contention  Sr  =Rt^- 
The  task-specific  feature  vector  7,-.  is  comprised  of  the 
task’s  deadline,  a  binary  indicator  for  whether  or  not  the 
task’s  precedence  constraints  have  been  satisfied,  the  num¬ 
ber  of  tasks  sharing  this  task’s  resource,  a  binary  indicator 
for  whether  or  not  this  task’s  resource  is  available,  the  travel 
time  remaining  to  reach  the  task,  the  distance  agent  a  would 
travel  to  reach  r*  and  the  angular  difference  between  the  vec¬ 
tor  describing  the  location  of  agent  a  and  the  vector  describ¬ 
ing  the  position  of  Ti  relative  to  agent  a. 

We  compared  the  performance  of  our  pairwise  approach 
with  a  point-wise  approach  and  a  naive  approach.  In  the 
point-wise  approach,  training  examples  for  selecting  the 
highest  priority  task  were  of  the  form  := 

the  label  7™  was  equal  to  1  if  task  Ti  was  scheduled  in  ob¬ 
servation  m,  and  was  0  otherwise.  In  the  naive  approach, 
examples  were  comprised  of  an  input  vector  that  concate¬ 
nates  the  high-level  features  of  task  set  and  the  task-specific 
features  of  the  form  :=  7.^1 , 7r2 ,  ■  ■  • )  7t„];  la¬ 

bels  y™  are  equal  to  the  index  of  the  task  Ti  scheduled  in 
observation  m. 

Table  1  depicts  the  sensitivity  (i.e.,  true  positive  rate)  and 
specificity  (i.e.,  true  negative  rate)  of  the  model.  We  found 


Pair-Wise 

Point-Wise 

Naive 

Decision  Tree 

95%/96.0“/o 

29.4“/o/99.3% 

2.81%/70.6% 

KNN 

37.5%/64.8% 

4.89%/27.3% 

7.87%/68.5% 

Logit 

73.8%/69.6% 

4.35%/67.7% 

-/- 

SVM-RBF 

4.38%/99.3% 

2.17%/99.6% 

1.69%/99.4% 

Neural  Net 

71.9%/60.7% 

0.00%/99.9% 

2.81%/94.3% 

Random 

6.49%/50.0% 

6.49%/50.0% 

6.49%/50.0% 

Table  1;  Sensitivity/specificity  for  machine  learning  tech¬ 
niques  using  the  pair-wise,  point-wise,  and  naive  ap¬ 
proaches. 


that  a  pairwise  model  outperformed  the  point- wise  and  naive 
approaches.  Within  the  pairwise  model,  a  decision  tree  pro¬ 
vided  the  best  performance:  The  trained  decision  tree  was 
able  to  identify  the  correct  task  and  when  to  schedule  that 
task  95%  of  the  time,  and  was  able  to  accurately  predict 
when  no  task  should  be  scheduled  96%  of  the  time. 

We  sought  to  more  fully  understand  the  performance  of 
a  decision  tree  trained  with  a  pairwise  model  as  a  function 
of  the  number  and  quality  of  training  examples,  as  shown  in 
Table  2.  We  trained  decision  trees  with  our  pairwise  model 
with  15,  150,  and  1,500  demonstrations.  The  sensitivity  and 
specificity  reported  in  Table  2  for  15  and  150  demonstra¬ 
tions  are  the  average  sensitivity  and  specihcity  of  ten  mod¬ 
els  trained  via  random  sub-sampling  without  replacement. 
We  also  varied  the  quality  of  the  training  examples,  assum¬ 
ing  the  demonstrator  was  operating  under  an  e-greedy  ap¬ 
proach  with  a  probability  of  (1  —  e)  of  selecting  the  correct 
task  to  schedule  and  selecting  another  task  from  a  uniform 
distribution  otherwise.  This  assumption  is  conservative;  a 
demonstrator  making  an  error  would  be  more  likely  to  pick 
the  second-  or  third-best  task  than  selecting  a  task  at  random. 

Training  a  model  based  on  pairwise  comparison  between 
the  scheduled  task  and  unscheduled  tasks  effectively  pro¬ 
duced  policies  of  comparable  quality  to  those  generated  by 
the  synthetic  expert.  The  decision  tree  model  performed 
well  due  to  modal  nature  of  the  multi-faceted  schedul¬ 
ing  heuristic.  We  note  that  this  dataset  was  composed  of 
scheduling  strategies  with  mixed  discrete-continuous  func¬ 
tional  components,  and  in  future  work,  performance  can  be 
further  improved  by  combining  decision  trees  with  logistic 
regression.  This  hybrid  learning  approach  has  seen  success 
in  machine  learning  classihcation  tasks  (Landwehr,  Hall, 
and  Frank  2005),  and  can  be  readily  applied  to  this  appren¬ 
ticeship  scheduling  framework. 

Real-World  Data  Set 

We  trained  and  tested  a  decision  tree  on  our  pairwise 
scheduling  model  via  leave-one-out  cross-validation  using 
the  sixteen  real  demonstrations  in  which  a  player  mitigated 
all  enemy  missiles.  Each  demonstration  came  from  a  unique 
threat  scenario.  Features  for  each  decoy/missile  pair  (or 
null  decoy  deployment  from  inaction)  included  indicators 
for  whether  the  decoy  had  been  placed  such  that  the  mis¬ 
sile  was  successfully  distracted  by  that  decoy,  whether  the 


Proportion  of  Correct  Demonstrations  (1  -  e) 


100% 

90% 

80% 

Number  of 

Demonstrations 

1500 

93.0%/91.0% 

84.0%/90.0% 

76.0%/87.0% 

150 

91.6%/91.0% 

79.5%/87.6% 

67.6%/90.7% 

15 

77.2%/91.8% 

71.8%/91.8% 

59.2%/73.7% 

Table  2:  Sensitivity/specihcity  for  a  pair-wise  decision  tree 
varying  the  number  and  proportion  of  correct  demonstra¬ 
tions. 


missile  would  be  lured  into  hitting  the  ship  by  the  decoy 
placement,  or  whether  the  missile  would  be  unaffected  by 
the  placement.  Across  all  sixteen  scenarios,  the  average  and 
standard  deviation  of  players’  scores  was  74,728  ±  26,824. 
With  merely  15  examples  of  expert  human  demonstrations, 
our  apprenticeship  scheduling  model  was  able  to  achieve  an 
average  score  of  87,540  with  standard  deviation  of  16,842. 

We  performed  statistical  analysis  to  evaluate  our  hypothe¬ 
sis  that  the  scores  produced  by  the  learned  policy  were  statis¬ 
tically  significantly  better  than  the  average  scores  achieved 
by  the  human  demonstrators.  The  null  hypothesis  stated 
that  the  number  of  scenarios  in  which  the  apprenticeship 
scheduling  model  achieved  superior  performance  was  less 
than  or  equal  to  the  number  of  scenarios  in  which  the  aver¬ 
age  score  of  the  human  demonstrators  was  superior  to  the 
apprenticeship  scheduler.  We  set  the  significance  level  at 
a  =  0.05.  Application  of  a  binomial  test'  rejects  the  null 
hypothesis,  meaning  that  the  learned  scheduling  policy  per¬ 
formed  better  than  the  human  demonstrators  on  statistically 
signihcantly  more  scenarios  (12  versus  4  scenarios),  with 
p  =  0.011.  This  promising  result  was  achieved  with  a  rela¬ 
tively  small  training  set  and  indicates  that  the  learned  policy 
can  form  the  basis  for  a  training  tool  to  improve  the  average 
player’s  score. 

Conclusions 

We  propose  a  technique  for  apprenticeship  scheduling  that 
relies  on  a  pairwise  comparison  of  scheduled  and  unsched¬ 
uled  tasks  to  learn  a  model  for  task  prioritization.  We  vali¬ 
date  our  apprenticeship  scheduling  algorithm  on  both  a  syn¬ 
thetic  data  set  covering  a  variety  of  scheduling  problems 
with  lower-  and  upperbound  temporal  constraints,  resource 
constraints  and  travel  distance  considerations,  as  well  as  a 
real-world  data  set  where  human  demonstrators  solved  a 
variant  of  the  weapon-to-target  assignment  problem.  Our 
approach  is  able  to  learn  scheduling  policies  of  superior 
quality  to  those  generated,  on  average,  by  human  experts 
conducting  an  anti-ship  missile  defense  task. 


*The  probability  of  rejecting  the  null  hypothesis  is  1  — 
{a+b)p'^i^  ~  P)*‘  P  ~  and  where  a  and  b  are 
the  number  of  scenarios  the  apprenticeship  scheduling  model  per¬ 
formed  better  than  the  average  human  score,  and  vice  versa,  re¬ 
spectively. 
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